Inverted and mirror repeats in model nucleotide sequences 
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We analytically and numerically study the probabilistic properties of inverted and mirror repeats 
in model sequences of nucleic acids. We consider both perfect and non-perfect repeats, i.e. repeats 
with mismatches and gaps. The considered sequence models are independent identically distributed 
(i.i.d.) sequences, Markov processes and long range sequences. We show that the number of repeats 
in correlated sequences is significantly larger than in i.i.d. sequences and that this discrepancy 
increases exponentially with the repeat length for long range sequences. 
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I. INTRODUCTION 

The complete sequencing of large genomes has lead to 
reconsider the importance of non coding DNA or RNA 
in the regulation of the activity of the cell [l[ . Many dif- 
ferent types of sequences able to have a regulatory role 
have been discovered. Among these sequences inverted 
and mirror repeats play an important role. For exam- 
ple inverted repeats provide the necessary condition for 
the potential existence of a hairpin structure in the tran- 
scribed messenger RNA and/or cruciform structures in 
DNA 0]. Inverted repeats play also an important role 
for regulation of transcription and translation. In bacte- 
ria, inverted repeats and the associated hairpin structures 
are oftenpart of rho-independent transcription termina- 
tors [3j recent years there has been a growing 
interest for these structures triggered by the discovery 
of new classes of regulatory elements. Prominent exam- 
ples of these new regulatory RNA families are microRNA 
(miRNA) [i, H, 0] and small interference RNA (siRNA) 
[S', 9] . Most of these structures share the property of be- 
ing associated with an hairpin secondary structure. DNA 
or RNA short sequences that may be associated to RNA 
secondary structures are present in genomes of different 
species of phages, viruses, bacteria and eukaryotes. Indi- 
cation about the potential existence of RNA secondary 
structures can be inferred throughout the detection of 
short pair sequences having the characteristic of inverted 
repeats in the investigated genomes 
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mirror repeats may have multiple biological roles. For 
example, perfect or near-perfect homopurine or homopy- 
rimidine mirror repeats can adopt triple-helical H con- 
formations [31 ■ Several computer programs have been 
developed to detect repeats and/or the associated sec- 
ondary structure in DNA or RNA sequences [l^, [l6j . 
Few studies have considered the problem of the expected 
number of repeats in model sequences (itI ITsj . mainly 
investigating the clustering of repeats. 

The purpose of this paper is to derive analytical an 
numerical expressions for the expected number of two 
specific, yet very important, type of repeats under the as- 
sumption that the investigated sequence can be modeled 
with a given family of stochastic process. In this paper 



we consider inverted and mirror repeats and we investi- 
gate four different types of sequence models. Specifically, 
we consider independent and identically distributed se- 
quences, first order Markov chains, higher order Markov 
processes, and long memory sequences. For the first two 
types of models we are able to derive analytically expres- 
sions for the number of repeats, while for the last two 
classes of models we use numerical simulations to infer 
phenomenological expressions for the expected number 
of repeats. 

The outline of the paper is the following. In Section 

II we introduce the investigated repeats and in Section 

III we introduce the sequence models discussed in the 
papers. In Section IV we consider independent and iden- 
tically distributed sequences and we derive several ana- 
lytical expressions for repeats. In Section V we consider 
first order Markov chains and in Section VI we compute 
numerically the expected number of repeats for higher 
order Markov processes. In Section VII we consider long 
memory sequences and Section VIII concludes. 



II. INVERTED AND MIRROR REPEATS 

In this paper we consider two types of repeats, i.e. in- 
verted and mirror repeats. These repeats are composed 
by two non-overlapping segments of nucleotide sequence 
that can be separated by another nucleotide subsequence. 
A mirror repeat is for example 5'GATTCGAacgAGCT- 
TAG3' where the sequence GATTCGA is repeated in 
an inverted way after the spacer acg. An inverted re- 
peats is for example given by the sequence 5'GATTC- 
GAacgTCGAATCS' where the sequence GATTCGA is 
repeated and complemented after the spacer acg. One 
of the problem in counting repeats is the fact that a sin- 
gle repeat can be counted many times if one does not 
define in some way a maximal repeat. Consider for ex- 
ample the sequence 5'aggaatcgatcttaacgaagatcgattcca3'. 
This sequence contains many different inverted repeats, 
for example, 5'aggAATCGatcttaacgaagatCGATTcca3' 
or 5'aggaaTCGATCttaacgaaGATCGAttcca3'. If one 
does not consider inverted with mismatches, there is 
one maximal inverted repeats, i.e. 5'aGGAATCGATCT- 
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Finally for mirror repeats the characteristic matrix is 



FIG. 1: Secondary structure formed by an inverted repeat 
in single stranded RNA. Bases from 7 to 25 constitute the 
left arm of the stem and bases from 32 to 49 constitute the 
right arm. The loop is made by bases 26 to 31. At base 19 
there is a gap (or one-base bulge), and bases 12-14 and 42- 
44 constitute a three base mismatch (or internal loop). Note 
that base 6 is not complementary to base 50 and base 26 
is not complementary to base 31 in order to have maximal 
repeats. According to the terminology used in the paper we 
have £ = 18, m = 6, k = 3, g — 1. 



TaacgAAGATCGATTCCaS', in which the first base be- 
fore and after the structure are not complementary and 
also the first and the last base of the spacer aacg are not 
complementary. When one considers inverted or mir- 
ror repeats with mismatches the definition of maximal 
repeat is less clear and must be clearly defined (see Sec- 
tion [IVB|). In this paper we are interested in finding the 
expected number of maximal inverted and mirror repeats 
in model genome sequences. 

A repeat is characterized by the assignment of a match- 
ing rule between couples of nucleotides. For RNA se- 
quences the matching rule is defined by a 4 x 4 matrix 
whose rows and columns correspond to nucleotides A, C, 
G, and U. A matrix entry is 1 if the matching between 
the nucleotides in the row and in the column is allowed 
and zero elsewhere. For example the characteristic ma- 
trix for inverted repeats in which only Watson-Crick base 
pair (i.e. A-U and C-G) are allowed is 
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10 
10 
10 



(1) 



If the pairing G-U (or GU wobble) is allowed the matrix 
becomes 



1 
10 
10 1 
10 10 



(2) 
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(3) 



Inverted and mirror repeats can be formed both in DNA 
and in RNA. Since our results are the same for both 
nucleic acids (provided one replaces T with U) , wc decide 
to consider repeats in RNA. 

Given a matching rule, a perfect repeat of stem length 
£ exists at point x when, for a loop value m, every base 
X + 1 — i matches every base x + m + i for 1 < i < i- 
The sequence from x + 1 — £ to x will be called left arm 
of the stem, whereas the sequence from x + m + I to 
x+m+£ will be called right arm of the stem. Since we are 
interested in maximal repeats, the repeats is defined also 
by requiring that base a; -I- 1 does not match base x + m 
and base x — i does not match with base x + m + i + 1. 
We will call these repeats perfect because there are no 
bulges or mismatches. A gap or one-base bulge is present 
in the left arm of the stem if there exists an index j such 
that the above relation is true for i < j, whereas for 
i > j every base x — i matches every base x + m + i. 
The extension to bulge in the right arm of the stem is 
straightforward. Finally, a one nucleotide mismatch (or 
internal loop) is present in the stem if for some i' between 
1 and £ — 2, the base x — i' does not match with base 
x + m + i' + 1. More mismatches or a mismatch composed 
of more then one base can be present in a stem. Inverted 
repeats are known to be able to create hairpin structures 
in single strand nucleic acids. Figure [T] shows an example 
of hairpin structure formed by an inverted repeat with a 
bulge and a three base mismatch. The caption should 
help the reader in understanding the terminology used 
in this paper. 

The purpose of this paper is to derive the expected 
number of repeats of a given type for simple models of 
nucleotide sequences. We shall indicate with N{£, m, k, g) 
the expected number of repeats of stem of length i, loop 
of length m, k one-nucleotide mismatches and g gaps. 
The calculation of the expected number of repeats is com- 
plex for two main reason. The first problem is to compute 
the probability Tr(£,m, k, g) that a given short sequence 
generated according to a sequence model can host a re- 
peat with given characteristics. Once this probability 
is known the next problem is to estimate the expected 
number of repeats observed in a long sequence (genome) 
composed by N nucleotides. If the occurrence of different 
structures were independent one from the other the ex- 
pected number is simply N{£,m,k, g) = NT:(£,m, k, g). 
Unfortunately, in general the occurrence of a given struc- 
ture is not independent of the presence of another struc- 
ture. In the statistical search of simple words in genomes 
this is a known problem (see for example p^). However, 
since we search for maximal repeats and the structure 
we are interested in are long and complex, we neglect the 
problem of non independence. In all the cases considered 
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below we have performed extensive numerical simulations 
to test our formulas and, indirectly, the independence as- 
sumption. By performing careful statistical tests (usually 
tests) we cannot reject the hypothesis that our formu- 
las are correct. For this reason in the following we present 
the formulas for A^(^, m, k, g) rather than for 7r(^, to, k, g). 

III. MODELS FOR NUCLEOTIDE SEQUENCES 

A. Independent Identically Distributed Sequences 

The simplest model for nucleotide sequences is the in- 
dependent identically distributed (i.i.d.) model. In this 
model one assumes independent nucleotides with proba- 
bilities Pa, Pc, Pg, and p„, such that Pa+Pc+Pg+Pu = 

1. Although it is known that correlation between nu- 
cleotides are significant, this model allows exact analyti- 
cal calculations and can be used as a useful starting point. 

It is useful to define the probability vector p'^ = 
{pa,Pc,Pg,Pu) where the elements are the nucleotide 
probabilities. Given a type of structures characterized 
by the matrix M we introduce the scalar quantity 

q = p^Mp. (4) 

For example, inverted repeats have q = 2paPu + '^PcPg, 
whereas for mirror repeats 9 = Pa + + + Pu- 

B. Markov models 

A better class of models for nucleotide sequences is 
the class of Markov processes. Let us consider for conve- 
nience the infinite sequence X^, where i G Z and Z is the 
set of integers. An ergodic stationary m-th order Markov 
chain is characterized by the transition matrix 

p(am+i|ai, flm) (5) 

= P(Xi = arn+l\Xi^i = a,n, Xi^rn ~ 0,l) ■ 

The simplest Markov chain we shall consider exten- 
sively in the following is the 1-st order Markov chain. 
This type of processes is characterized by the 4x4 tran- 
sition matrix p{a2\ai). By taking powers of this matrix 
one can also define the fc-step transition matrix whose 
elements are pk{b\a) = P{Xi — b\Xi^k — a)- In this 
notation p(a2 1 ai) =pi(a2|ai). 

The model parameters, i.e. the order of the Markov 
chain and the transition probabilities, of a real sequence 
can be estimated by the maximum-likelihood method 
(see for example [l9j). 

C. Long memory models 

In recent years it has been proposed that parts of real 
genomes are not well described by Markovian models. 



but rather that a long memory (or long-range) process 
describes better the correlation properties of nucleotide 
sequences [13, HH [H, [H, [111- There are several ways 
of detecting and modeling correlation properties of nu- 
cleotide sequences. The approach we will follow is called 
"DNA walk" [2^ and consists in mapping the nucleotide 
sequence in a one-dimensional random walk x. Since 
there are 4 different residues in a RNA sequence while the 
random walk has two possible directions (Aa; = ±1), one 
needs to choose a mapping rule from the 4 residues to the 
2 directions. Several different mapping rules have been 
introduced (24| . In the present paper we consider two im- 
portant rules: (i) the purine-pyrimidine rule (or RY rule) 
which assigns Ax = +1 if the residue is a purine (A or G) 
and Ax = — 1 if the residue is a pyrimidine (C or U) and 
(ii) the hydrogen bond energy rule (or SW rule) which 
assigns Ax = +1 for strongly bonded residues (C or G) 
and assigns Ax = — 1 for weakly bonded residues (A or 
U) . This second rules can be useful to take into accounts 
the isochore structure of genome [1^ . By using either of 
these rules it has been observed that in most cases non- 
coding DNA sequences, i.e. DNA sequences not coding 
for proteins, display long-memory properties of the cor- 
responding DNA walk. We remind that a long-memory 
process is a process whose autocorrelation function of 
Axi decays in time as Corr[Axi+r Axi] ~ where 
< 7 < 1. Long memory processes are an important 
class of stochastic process that have found application in 
many different fields [1^. The autocorrelation function 
of a long memory process is not integrable in r between 
and +00 and, as a consequence, the process does not 
have a typical time scale. Long memory processes are 
better characterized by the Hurst exponent H that, for 
long memory processes, is if = 1 — 7/2. Thus for long- 
memory processes 1 /2 < < 1 . 

Long memory properties of nucleotide sequences has 
been associated to different genome characteristics in- 
cluding nucleosomal structure in eukaryotes [131, to the 
presence of isochores [1^ and to the presence of tan- 
dem repeats [28l|. More recently it has been suggested 
that in some genomes (for example, human) the corre- 
lation properties of DNA cannot be captured by a sin- 
gle Hurst exponent, but rather that the Hurst exponent 
may depend on the observation scale [1^, [s^l • Different 
scales can be associated with different biological struc- 
tures (genes, transposable elements, isochores). 



IV. INVERTED AND MIRROR REPEATS IN 
IID SEQUENCES 

A. Perfect repeats 

The expected number of perfect repeats of stem length 
I and loop length to in a i.i.d. genome of length N char- 
acterized by the parameter q is 

7V(^,m) = iV(l-g)"g^ (6) 
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where the exponent a is equal to 1 for m < 1 and is equal 
to 2 for m > 2. In other words we need to impose that 
the £ bases of the left arm of the stem match with the 
corresponding bases in the right arm. Moreover we need 
to impose that the first couple of bases in the loop does 
not match, such as the first couple of bases at the end of 
the stem. When the loop is shorter that 2 nucleotides one 
cannot impose that the first couple of bases in the loop 
does not match and this explains the different value of 
the exponent a. Since in a i.i.d. sequence the occurrences 
of nucleotide are independent probabilities factorize and 
Eq. [S] is obtained. This expression has been used, for 
example, in Ref. (l^ to investigate the number of perfect 
inverted repeats in bacterial genomes. 



B. Inverted with mismatches 

A mismatch in a repeat is the presence of a pair of 
nucleotides in the stem that do not match. We indicate 
with k the number of mismatches in the stem and we 
look for an expression for N{£, m, k). We prove that the 
expected number is 



N{l,m,k) = N 



£-2 
k 



(1 - (7)"+^g^"^ 



(7) 



where the exponent a assumes the same values as in 
Eq. [6l In fact a mismatch can be present only in one 
of the £ — 2 internal nucleotide of the stem (i.e. from 
the second to the {£ — l)-th nucleotide). There are (^^■^) 
ways of placing k mismatches in £ — 2 internal bases of 
the stem. 

One of the problem of Eq. [7] is the fact that, for ex- 
ample, a repeat with one mismatch can also be seen as 
a repeat with zero mismatches and a shorter stem. We 
shall denote these two repeats as embedded. One is usu- 
ally interested in counting more embedded repeats only 
once. Moreover programs designed for the search of in- 
verted repeats, such as palindrome of the EMBOSS pack- 
age [lEj l , effectively count embedded inverted repeats only 
once. Therefore we need a formula for non embedded re- 
peats. Clearly any repeat with, say, zero mismatches can 
be thought as part of a longer repeat with a large num- 
ber of mismatches. In other words we need to introduce 
an upper value of the number of mismatches, in order 
to find an expression for non embedded repeats up to a 
chosen value of the number of possible mismatches. For 
example we can ask for the expected number of inverted 
repeats with zero mismatches that cannot be seen as part 
of longer inverted repeats with one mismatch. This of 
course does not guarantee that the found repeats cannot 
be part of repeats with two mismatches. From an oper- 
ative point of view, this corresponds to run the search 
program (for example palindrome) with a maximal num- 
ber of mismatches equal to A:. Therefore a quantity more 
meaningful than Eq.([7]) is N^''\£,m,k), which is the ex- 
pected number of repeats of stem length £, loop length 
TO, and k mismatches, that cannot be part of a longer 



repeat of the same type with at most k mismatches. By 
definition k > k. The two expressions of Eq.sjH] and [7] cor- 
respond to iV'"' {£, TO, 0) and iV^*"'' {£, m, fc), respectively. 
When fc = 1 we have 



M^^{£,m,0) = N{1 - qYq^ 

{2 for < TO < 1 

3 for 2 < TO < 3 

4 for TO > 4. 

When fc = 2 we have 

N''^\£, to, 1) = N{1 - 2)(1 - qT+\^-^ 

2 for < TO < 1 

3 for 2 < TO < 3 

4 for m > 4. 



(8) 



(9) 



and 



7V(2)(^,TO,0) = iV(l - g)"g^ 

3 for < TO < 1 

4 for 2 < TO < 3 

5 for 4 < TO < 5 

6 for TO > 6. 



a 



(10) 



The general formula is 



N^\£,m,k) = N 



£-2 



(1 - qY'+%^~'' 



(3 — {k — k) + max I 0, min( 



1 for < TO < 1 

2 for TO > 2 

TO 



i,fc-fc) , (11) 



where [x] indicates the integer part of x. 

We have performed extensive numerical simulations of 
artificial genomes and we have verified that these ex- 
pressions are correct. Specifically, we have written com- 
puter programs able to detect inverted or mirror repeats 
with the required characteristics (stem and loop length, 
mismatches, etc.). Then we have performed a test 
between the frequency of observed repeats and the fre- 
quency expected by our theory. In all cases we cannot 
reject the hypothesis that our formulas are correct. 



C. Repeats with one gap 

We consider now the case of inverted and mirror re- 
peats with one gap in the stem and no mismatches. We 
shall indicate with £ the number of links in the stem, 
since in such a structure there will be £ nucleotides in 
one branch of the stem and ^ -I- 1 in the other. The ex- 
pected number of repeats with the gap in one specific 
position is the same as for perfect repeats (see Eq. [S]), 



N{£,m,k = 0,g=l)^N{l-qrq', 



(12) 
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FIG. 2: Schematic representation of the stem of the two pos- 
sible secondary structures formed by an inverted repeat with 
stem length £ = 3 and one gap. The continuous lines indicate 
complementarity and the labels on the bases are used in the 
text. 



where the exponent a is equal to 1 for m < 1 and is 
equal to 2 for m > 2. One could think that, since there 
are £ — 1 possible positions for the gap (on one arm), the 
expected number of repeats with one gap in any position 
of one arm is simply i—1 times the value in Eq. 1121 This 
is wrong because the probability of observing the gap in 
one position is not independent from the probability of 
observing the gap in another position. To understand 
why, let us consider an inverted repeat with £ = 3 and 
one gap. As shown in Fig. [2] there are two positions for 
the gap, and the corresponding structures are indicated 
as ^1 and A2 in the figure. The probability of observing 
either Ai or A2 or both is 

P(Ai U A2) = P(Ai) + P(A2) - P(Ai n A2). (13) 

P{Ai) and P{A2) are equal to the quantity in Eg [T^ 
whereas P{Ai D A2) is the joint probability that the se- 



quence can form both structures Ai and A2. By looking 
at the figure we note that the sequence can form both 
structures ii X = Y = where the bar indicates com- 
plementarity. Thus the joint probability is 

P(^i n^2) = {l~qrq^{plpt+PaPl+plPa+PcPl) 

= (1 - qTq'q. (14) 

For inverted repeats the quantity q is the probability that 
X = Y = Z and it is equal to p^pt +PaPt + plPg +PcpI- 
Analogously for mirror repeats q is the probability that 
X = Y = Z and it is equal to + + pj? -I- . In 
conclusion, the expected number of repeats with £ — 3 
and one gap is 

N{£ = 3, TO, /c 0, .9 = 1) = N{1 - qTq^{2q - g), (15) 

which is of course different from the naive (and wrong) 
answer given by twice Eq. 1121 The generalization of this 
last formula to a generic value of £ is not straightforward 
and the derivation is reported in Appendix 1 . The result 
is 

N{£,m, = 0, 1) = 2Nq'^-^{l - g)"[(€ - l)q ~ {£ - 2)q\ 

r 1 for < TO < 1, . 
" ^ I 2 for m > 2, ^^^^ 

where the factor 2 in front q^^^ is due to fact that the 
gap can be found in one of the two arms. It is worth 
noting that for large £ the correct answer of Eq. (|16p is 
3/4 of the naive and wrong answer given by — 1 times 
the expression of Eq. (fT2|) . 



V. INVERTED AND MIRROR REPEATS IN 
FIRST ORDER MARKOV CHAINS 

We now give the expression for the expected number 
of repeats for a model sequence described by a 1-st order 
Markov chain. We consider the simpler case of the ex- 
pected number of perfect repeats with a given stem (of 
length £, as before) and a generic loop of length m > 2. 

The calculation is performed in Appendix 2 and the 
result is 



Praarkov 

ni ,n2,,...,n£ — 1 

- ELiP("-ik)p(S|"-i) ) {pni+i{nt\ni!) - J2t=iPi'>^e\y)Pm-i{y\y)p{v\n() 



(17) 



p{ni)p{ni) 

I 

where fii indicates a base matching with base n^, i.e. the complementary of Ui for inverted repeats and fii = rii 
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FIG. 3: Plots of the ratio Pmarkov{£,'m)/Piid{i,m) between 
the probabihty of observing an inverted repeat with stem 
length £ and loop length m in a Markov and in an i.i.d. 
genome as a function of the loop length m. The parame- 
ters characterizing the models are estimated by four model 
genomes, i.e. Hepatitis B virus (a), E. coli (b), Drosophila 
mitochondrion (c), and Homo mitochondrion (d). In each 
panel the curves refer to £ — 4, i = 5, and £ = 6 (from 
bottom to top). 



for mirror repeats. In Eq. [T7]p(ni) is the probability of 
occurrence of base i and p{nin2.--ni) is the probability of 
occurrence of the word nin2...ne, that for Markov chain 
is easily computable (see also Appendix 2). Even if the 
expression (|17p looks complex, the numerical summation 
is easily and quickly performed for example with simple 
programs in Mathematica. It is worth noting that the 
summation is over 4^ terms, whereas a direct calculation 
taking into account all the possible repeats would require 
to sum 42^+2+m terms. 



The functional dependence of Pmarkovi^, m) from £ and 
m are not evident by eye, such as the relative mag- 
nitude of Pmarkovii,m) and Pud{i,m) = (1 - q)"q^ 
for an i.i.d genome (see Eq.(IS])). Thus we discuss here 
these issues by considering Markov models with param- 
eters equal to the ones obtained by real genomes of 
model organisms. Specifically we shall consider four 
complete genomes: (i) the Hepatitis B virus (accession 
NC_003977, length= 3,215 bp), (ii) the Escherichia coli 
K12 genome (accession NC_000913, length=4, 639, 675 
bp), (iii) the Drosophila melanogaster mitochondrion 
(accession NC_00I709, lcngth=19, 517 bp), and (iv) 
the Homo sapiens mitochondrion (accession NC_001807, 
length=16, 571 bp). Moreover we consider inverted re- 
peats. 

We first discuss the dependence of Pmarkovif-ifn) from 
the loop length m. To this end we computed the ratio 
Pmarkovi^, ^n)/ Pud^t, m) for the stem length fixed at £ — 
4, 5, and 6. Figure [3] shows this quantity for the four 
model genomes. We see that Pmarkov{^,'ni) has a small 
dependence from m. More precisely for m larger than few 
units, Pmarkovi^i'm)/ Piidi^^'m) bccomcs independent on 
m. The loop length dependence for small values of m can 
be positive (panels a,c, and d) or negative (panel b) with 
respect to the value for large m. In all cases the ratio 
Pmarkov{i,rn)/Piid{e,m) is significantly larger than one 
and it increases with the stem length £. 

Because of the small dependence on m we can consider 
Pmarkov(f-, TTi) for large values of m as a good approxima- 
tion of the probability of observing repeats. This approx- 
imation leads to a simplification of Eq. 1171 In fact, when 
m is large one can approximate the conditional probabili- 
ties m'Eq.\Uip.m+i{ni\ne) ~p{ni) a.nd Pm-i{y\y) ^p{v)- 
Thus the probability Pmarkovi^-, fn) becomes independent 
from m and equal to 



Pmarkov 

{£,m)= ^ p{nin2...nt)p{nt...n2ni) (18) 

ni ,712 ,,. . ■ .rif — 1 

(pini) - ELiP("-i|2^)p(*I"i)) (pi^e) -T.'l=iP{nt\y)p{y)p{y\nt)^ 

p{ni)p{ni) 

I 



We can now study the dependence of Pmarkov{i,m) 
from the stem length £, by considering the cases when 
m is larger than 4 bp. Figure |4] shows the ratio 
Pmarkoviiim)/ Piid{i,'m) as a function of £ for the four 
genomes. In all cases the ratio Pmarkov{i,'m) / Piid{f-,ra) 
increases almost linearly with the stem length t. For 
£ < 10 the order of magnitude of the error made by the 
iid model in predicting the number of repeats of a Markov 
sequence ranges between few percents and 30%. 



A. A simplified model 



The fact that even for large values of m the number 
of inverted repeats expected in a Markovian genome is 
significantly larger than the number expected in an iid 
genome can be explained in a simplified model of genome 
sequence. We assume that the nucleotide alphabet is 
composed only by two symbols (instead of four), that 
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FIG. 4: Plots of the ratio Pmarkov{i,m) / Piid{(.,m) between 
the probability of observing an inverted repeat with stem 
length £ and loop length m > 5 in a Markov and in an 
i.i.d. genome as a function of the stem length £. The pa- 
rameters characterizing the models are estimated by four 
model genomes, i.e. Hepatitis B virus (empty circles), E. coU 
(empty squares), Drosophila mitochondrion (filled squares), 
and Homo mitochondrion (filled circles). 
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FIG. 5: Plots of the ratio PfcMM(^, n^)/^iiij(^, Ti) between the 
probability of observing an inverted repeat with stem length 
£ and loop length m > 5 in a fc— th order Markov and in 
an i.i.d. genome as a function of the stem length £. The 
parameters characterizing the models are estimated by four 
model genomes, i.e. Hepatitis B virus (empty circles), E. coli 
(empty squares), Drosophila mitochondrion (filled squares), 
and Homo mitochondrion (filled circles). 



the transition matrix is parameterized as 



(19) 



and that the process is stationary, so that the probability 
for the two symbols are equal to 1/2. The parameter d is 
a measure of the distance from the iid model. With this 
transition matrix, the conditional probability p(n2|7ii) is 
equal to 1/2 + (5 if rti = ?i2 and to 1/2 — 5 if ni 7^ 712. We 
shall call permanence the first case and change the second 
one. We simplify further the original model by removing 
the constraints that the repeat is maximal, i.e. the con- 
dition that the two bases before and after the repeat are 
not complementary and that the first and last base in the 
loop are not complementary. The probability of an in- 
verted repeat of stem length £ and loop length m» 1 is 
given by the product of the probability of the left part of 
the stem times probability of the right part of the stem. 
The probabilities factorize because we have assumed that 
the loop is large. Now the probability for a given word 
in the left part of the stem is 2-^{l/2 - 5Y^ {1/2 + 6^^ , 
where di is the number permanencies, whereas ^2 is the 
number of changes. Clearly \t is di + d2 = £ — 1. The 
probability for the inverted and complemented word in 
the right arm of the stem is equal, so the probability for 
a given inverted is [2'^{l/2 - 5Y^{l/2 + Sf^f . We have 
to sum this quantity over all possible words, i.e. 



P{£) 



E 

di=0 



£-1 
di 



2di 



2(f-l-di) 



U'- + 2S^ 
2 I 2 



(20) 



where the factor 2 in front of the sum comes from the fact 
there are two possible words with the same position of the 
permanencies and of the changes obtained by exchanging 
one symbol with the other. For an iid sequence the prob- 
ability for an inverted of stem length £ is Pudi^) — 2~^, 
thus the ratio is 



1 



{l + A5'f~\ (21) 



For small values of 5, i.e. for Markovian sequences not 
too different from iid ones, the binomial expansion gives 



l + (£- 1)4(^2^ 



5«1, (22) 



which is the almost linear behavior observed in Figure ID 
Thus we expect the linear behavior observed in Fig. 3] for 
the more complete model is valid for moderate value of 
the stem. 



VI. HIGHER ORDER MARKOV MODELS 

In the case of higher order Markov processes the ana- 
lytical computation of the expected number of inverted 
and mirror repeats becomes considerably more complex. 
Instead of trying to obtain complicated expression with 
difficult interpretation, we perform numerical simulations 
of higher order Markov chains and we compare the ob- 
served number of repeats with the number expected from 
the iid theory. The results of our simulations are shown 
in fig. [5] and indiate that the error made in using an 
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iid model to estimate the expected number of inverted 
repeats in a Markov chain increases with (i) the stem 
length £ and (ii) the order of the Markov process. Never- 
theless it is worth pointing out that for moderate values 
of the stem length the ratio PkAiM{i,n^)/Piidi£,'m') in- 
creases approximately linearly with i. This implies that 
in the considered range the number of inverted repeats 
in a Markovian genome is given by 

PkMM {£,m)^Akiq', (23) 

where Ak is a parameter which slowly increases with the 
order k of the Markov process. 



VII. LONG MEMORY PROCESSES 

Finally we consider the problem of estimating numer- 
ically the probability of occurrence of an inverted or a 
mirror repeat in a long-range nucleotide sequence. Since 
most of the repeats with biological role are likely to be 
find in non-coding regions of the genome which are of- 
ten composed by long memory nucleotide sequences, this 
analysis is particular relevant for application to real cases. 
We generated long memory nucleotide sequences by us- 
ing either the RY rule or the SW rule and with different 
values of the Hurst exponent H. For example to gen- 
erate a RY long memory genome we simulated a binary 
long memory process with values Xi = ±1. Then for each 
Xi = -1-1 we associated either a A or a G each with prob- 
ability 1/2 and for each 0:^ = — 1 we associated either a 
C or a U each with probability 1/2. Note that with this 
generation algorithm the simulated genomes have equal 
nucleotide frequencies, i.e. Pa = Pc = = Pt = 1/4. We 
then searched in the simulated genome for perfect repeats 
with a given stem length £ and loop length m and we com- 
pare the observed frequencies with the one expected by 
an iid genome. First of all we find that also for long mem- 
ory sequences the occurrence of inverted or mirror repeats 
is essentially independent on the value of the loop length 
m. As for the Markovian case we find a small dependence 
for very small values of m. The behavior as a function of 
the stem length £ is very different from the iid case. In 
figure [5] we plot the quantity PLMi£,Tn)/ Piidi£,'m) as a 
function of £, where Plm(^, is the observed probabil- 
ity of inverted repeats in the long memory sequence. The 
left panel shows the RY (or purine-pyrimidine) rule and 
the right panel shows the SW (or hydrogen bond energy) 
rule. In the RY case for £ ^ 5 there is a decrease of the 
number of inverted repeats with respect to the iid case 
whereas for £ ^ 5 the number of observed inverted re- 
peats is larger than the number expected in the iid case. 
However the value of the ratio PLM{i,iTi) / Piid{£,m) is 
never vary large. For the SW rule a different behavior 
is observed. In right panel of fig. [6] the y axis is in a 
logarithmic scale and the ratio Plm{£, m)/Pud{£, m) has 
a clear exponential dependence on £. Very large value of 
the ratio are observed showing that using the iid formula 



for long memory sequence can lead to a severe underesti- 
mation of the expected repeats. The difference observed 
between the two rules can be easily explained by recalling 
that an inverted repeats is formed when many bonds can 
be formed between complementary bases. Since in the 
SW rule the presence of, say, a C is strongly correlated 
with the presence of a G nearby, it is intuitive to under- 
stand why many more inverted repeats are observed in 
a SW than in a RY long memory genome with the same 
Hurst exponent. 

Since it is difficult to develop a theory for the number 
of repeats in a long memory genome, we try to get some 
intuition by considering the simplified model for Marko- 
vian genomes presented in section FV Al We remind that 
Eq- ED predicts that the ratio P{t) / Piid{£) depends expo- 
nentially from £ according to exp(^ ln(l -f 4(5^)) where 5 
quantifies the "distance" of the model from the iid case. 
We fitted the curves in the right panel of fig [6] with an 
exponential function and we estimated the correspond- 
ing value of 5 as a function of H . The inset of the right 
panel of figure [6] shows that to a good approximation 
6 = H — 1/2. This allows us to conjecture that the num- 
ber of inverted repeats in SW long memory sequences 
is 

PLM{e, m) = Pud{£, m) exp[£(l + 4{H - 1/2)^)] 

exp[^(l + 4(iJ- 1/2)2)]. (-24) 

For mirror repeats we find that long memory sequences 
generated according to either SW or the RY rule show a 
behavior essentially indistinguishable from the one shown 
in the right panel of Fig. [S] The reason is that both rules 
significantly increase the probability that two equal sym- 
bols are found at a short distance. As a consequence 
Eq. [M] holds also for mirror repeats according to either 
SW or RY rule. We stress again that this formula holds 
for sequences with approximately equal nucleotide fre- 
quencies. In conclusion, differently from the Markov case, 
the exponential behavior of P{£)/Piid{£) expected from 
the simplified model is observable in long memory se- 
quences also for small values of £. This is very important 
because it means that when the sequence is long memory 
(as in many non coding sequences) the expected number 
of repeats can be significantly larger than the number 
expected in an iid sequence. The discrepancy between 
iid and long memory models increases very quickly with 
H — 1/2. Many regions of real genomes can have very 
large values of H. For example, parts of the human chro- 
mosome 22 have an estimated Hurst exponent H = 0.88 
[3l| . In these cases a careful modeling of the nucleotide 
sequence is very important in estimating the expected 
number of repeats. 



VIII. CONCLUSIONS 

In conclusion we have developed many analytical and 
numerical results for the expected number of inverted 
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1 (stem length) 



FIG. 6; Plots of the ratio PLM{l,'m) / Piid{l,Tn) between the 
probability of observing an inverted repeat with stem length 
£ and loop length m > 5 in a long memory and in an i.i.d. 
genome as a function of the stem length I and of the Hurst 
exponent H . Panel (a) shows the RY (or purine-pyrimidine) 
rule and panel (b) shows the SW (or hydrogen bond energy) 
rule. The inset of panel (b) shows the fitted 5 (see text) 
as a function of H. The dashed line is the function 5 — 
H — 1/2. For each value of H we simulated an artificial genome 
of length 10* bp. 



and mirror repeats with different features (stem length, 
loop length, presence of mismatches or gap) under the 
assumption that the investigated sequence can be mod- 
eled with different types of sequence models. In general 
the computation of the number of repeats in model se- 
quences is a complicated problems due to combinatorial 
difficulties, non independence of different occurrences (as 
in the case of gaps), and difficulties related to the se- 
quence model (as for higher order Markov process and 
long memory sequences). To the best of our knowledge 
this is the most comprehensive study of the occurrence 
of inverted and mirror repeats in model sequences. A 
careful estimation of the expected number of repeats in 
a model sequence is crucial when the investigation of a 
real sequence displays the presence of an high number 
of repeats. Is this high number expected under some 
realistic hypothesis of the sequence model? Without a 



clear answer to this question it is very difficult to assess 
if the number of repeats observed in the real sequence 
has a potential biological role because the repeats are 
over-represented. The set of results we have obtained in 
this paper could usefully complement the repeat search 
algorithms to give a measure of the significance of the 
number detected occurrences. 
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IX. APPENDIX 1 

In this appendix we derive Eq. [16] for the number of 
repeats with stem length i and one gap. 

There are £ — 1 possible positions for the gap in one 
arm. Let us call Ai, {i = 1, ..,£ — 1) the set of structures 
in which the gap has the i-th position (see Fig. [16] for the 
case ^ = 3) . This ensemble of sets has the property that 
for any set of indices «i < Z2 < ... < ik it is 

P{A,, nA,,n...nA,,) = p{A^, nA,J. (25) 

In fact if the sequence under consideration can form a 
structure with the gap both in the ii and the ik posi- 
tion, then it can form the structure with the gap in any 
intermediate position. 

We state the following theorem. 

Theorem Given an ensemble of sets Ai, A2, A^f 
satisfying the property it holds 

N N-1 

p(Ai u u .... uAn)^J2 - -^(^^ ^ 

(26) 

In order to prove this theorem we need a lemma. 
Lemma Under the above hypothesis ([25]) . it is 

p[u:[Li(A,n^„+i)] = p(A„nA.+i). (27) 

In fact 



p[ur=i(A, n A„+i)] = p{[uiZiiA, n An+i)] u [A„ n A„+i]} 
= ^[[u:ri'(A: n A„+i)] + P[A„ n A„+i] - P{[UlZi{A, n A„+i)] n [A„ n A„+i]} 
= p[[u^-ii(A, n A„+i)] + P[A„ n A„+i] - p{ur=i'[(A, n A„+i) n (A„ n A„+i)]}, (28) 

I 



where we have used the inclusion-exclusion principle. By using twice the property (|25p we can rewrite 
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P[\J^il{A, n ^„+i)] + P[An n ^„+i] - P{u^j-ii[(A, n A,+i n ... n A„ n A^+i) n (A„ n A„+i)]} = 
P[u^ri\A,; n A„+i)] + p[An n A„+i] - F{u^ri\A, n A,+i n ... n A„ n A„+i)} = 
F[u^ri'(^^ n An+i)] + PK n An+i] - P{ur=i(A, n A„+i)} = 

= PKnA„+i], (29) 

I 



i.e. our thesis. 

We can now prove the theorem 1. We prove it by 
induction. The theorem holds for N = 2, because in 

I 



i.e. our thesis. For the benefit of the reader we note that 
in the second equivalence we use the inclusion-exclusion 
principle, in the fourth we use the lemma, and in the 
fifth we use the induction hypothesis, i.e. that the thesis 
holds for TV. 

In the case of repeats considered in the paper it is 
N — £ — 1 and P{Ai) = (1 — q)°'q^. Moreover for any i it 
is P{Ai n Ai+i) = (1 — q)"q^~^q. From these values and 
Theorem 1 (i.e. Eq.[26l), Eq. [l6]holds. 

X. APPENDIX 2 

In this section we derive the expression (jl7p for the 
expected number of perfect inverted and mirror repeats 
in a Markovian genome. 

Let us indicate the left part of the stem with nin2...ni 
and consequently the right part of the stem will be 
n^...n2rii, where the bar indicates matching accordingly 
to the type of investigated repeats. We shall also indi- 
cate with TOi, ...mm the loop and with xi {X2) the base 
before (after) the repeat. The repeats can be symboli- 
cally expressed as xinin2---n^rni...'mrnni...n2niX2- The 
probability for such a structure is 

p{xi)p{ni\xi)p{n2\ni)...p{Tni\nt)p(m2\rni) 

I 



this case Eq. (j26p is equivalent to the inclusion-exclusion 
principle. We assume that Eq. [26] holds for TV and we 
prove that it holds for + 1. In fact. 



(30) 



y.p{rn\mra)...p{ni\n2)p{x2\ni). (31) 

Since we are not interested in the specific bases in xi 
and X2 we can sum the probability in Eq. (pij) in xi and 
X2 requiring that they are not complementary (remember 
that we are looking for maximal repeats). The expression 
becomes 

p{n2\ni)...p{mi\nt)p{m2\mi) p{nt\mm).-.p{ni\n2) 

X ^ p{xi)p{ni\xi)p{x2\ni). (32) 

The sum term in Eq. ((32)) becomes 

p{xi)p(ni\xi)p{x2\ni) 

4 

= p{.ni) -^p{x)p{ni\x)p{x\ni), (33) 

x=l 

where we have used the property Yl'^x=iP^'^\v) = ^■ 

In expression l31l we need to sum over the possible loop, 
i.e. in the variables mi, ...rrim, by using the constraint 
"nil 7^ "i-m- We sum first over the internal bases of the 
loop 7712, "Tim^i obtaining 



Pi^lV^i) - P(uiIiA, U An+i) = P(uiIiA,:) + P{An+i) - P[{^liA,) n A^+i] = 
Pi^tiA,) + PiAN+i) - P[uiIi(A, n Am+i)] = PiufLiA,) + PiAM+i) - PiAN n Am+i) - 



N 



N-1 



J2 P{A^) ~ ^ ^'+1) + Pi^N+l) - PiAN n An+i) = 



N+1 



N 



^P(AO-^P(A,nA,+i) 



p{mi\ni)p(ni\mrn) ^ p(TO2|"ii)p(m3|m2) p(m„|m,„_i) = p(mi|n<>)p,„_i(m„j|mi)p(n£|m„j), (34) 

7712 ,...771^71 — 1 
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where Pkib\a) is the fc— step transition probabihty, i.e. 
the probabihty of having the symbol b conditioned to 
the fact that k step before the symbol was a. For Markov 
chain the fc— step transition probability matrix is easily 
obtained as the fc— th power of the one step transition 
probability matrix. In obtaining the equation l34l we have 
use the Chapman-Kolmogorov equation, that in its sim- 
pler form is J2l^iP{y\z)p{z\x) =p2{y\x). 

Last we need to sum the expression l34l over the vari- 
ables nil and by imposing that they are not com- 

I 



that can be simplified by noting that 
p{ni)p(n2\ni)...p{ni\ni-i) = p{nin2...ne) is the prob- 
ability of the ^— word of the left part of the stem. 
Likewise p{ng_i\hf)...p{ni\n2) — p{n£...n2ni)/p{ni) is 



plementary. By using again the Chapman-Kolmogorov 
equation we obtain 

^ p{mi\ne)p„i^i{mra\mi)p{ni\ 

4 

= Pm+i{ne\ne) - ^p{ne\y)pjn-i{y\y)p{y\ni). (35) 
By putting all the terms together we finally obtain 



(36) 



proportional to the probability of the ^— word of the 
right part of the stem. Hence the probability of a repeat 
with a specified sequence in the stem is 



I p{ni) - ^p{x)p{ni\x)p{x\fii) ] p{n2\ni)...p{ne\ni^i) 

\ x=\ 

4 

Prn+\{nii\ni) - ^p{ni\y)p„i-i{y\y)p{y\ni) p{ni_i\ni)...p{ni\n2). 

I 



(p("i) - ELiP(2;)p(»^i|a;)p(a;|ni)j (pm+i(n£|n<;) -X;^=iP("f|y)pm-i(2/|y)p(y|n^)j 

p{nin2...nt)p{nt...n2ni) — , , . (37) 

p[ni)p{ne) 



On the other hand it is easy to see that the corresponding all the transition probabilities satisfy p{x\y) = p(x), i.e. 



expression for a i.i.d sequence is 



the process has no memory and becomes i.i.d. 



Pad ^ p{nin2...ni,)p{ni,...n2ni){l - ^p{x)p{x)f . (38) 



In order to obtain the number of repeats of stem length 
£ and loop length m one needs to sum Eq. ([37]) over the 
4^ possible ^— words composing the left part of the stem, 
It is direct to show that Ea. ((37|) reduces to Eg. ((38)) when i.e. 

I 



Pmarkov 

(£,m) = ^ p{nin2.-.ni)p{ni...n2,ni) 

ni ,n2 , , . . . ,n^— 1 



(p("i) ~Y.l=iP{ni\x)p{x\ni)j {pm+i{ni\nt) - J2t,=i PMy)Pm-iiy\y)piy\ne) 

p{ni)p{ni) 

I 



(39) 



which is the result of Eq. (fT7|) . 
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